Clustering and Association

The techniques we discuss in this lecture involve detecting similarities between observations in a data set, generally with no a priori assumptions about what characteristics should drive the search for similarities.

Clustering analyses such as these are sometimes referred to as unsupervised learners because the algorithm is not provided instructions for which characteristics to use to detect similarity but is left to select these characteristics on its own.

This family of algorithms is especially helpful when: - Not much is known about some phenomenon or its associated characteristics. - You want to run a classification analysis on some data but do not know where to start.

Within this category of methods, there are a few more popular choices for cluster analysis - these include principal components analysis and k-means. Today, we will review each, beginning with the former.

Principal Components Analysis

One of the more typical applications of principal components analysis in the social sciences is variable reduction.

As an example, let’s say that we are trying to predict recidivism of an individual who has just left prison and we have multiple variables describing their characteristics, including a pretty exhaustive set of attitude scores we know to be associated with new criminal behavior. In this example, let’s suppose we have ten such scores.

The problem is that those ten scores are highly correlated with one another. If I include all of them in a regression model, chances are high that there will be very little unique variation left in each to explain the outcome. Instead, I would use principal components analysis to compute what is known as a factor score.

What principal components analysis technically does is to assume that there is a p-dimensional space (where there is one dimension for each variable) in which values on variables cluster together. The algorithm simplifies this by minimizing dimensions into those where observations vary more.

Simpson’s 3-Dimensional Space

This process results typically results in one or more underlying latent factors (though usually no more than three or so that are very interesting). These latent factors represent some concept that ties specific attitudes together such that individuals high on one score also tend to be high or low on other scores.

After we have detected these factors, we can then replace the ten original attitude scores in the models with the reduced number of factors, which now has fewer highly correlated variables (and, therefore, less severe multicollinearity issues).

Variable reduction is very helpful (and recommended) in scenarios where you have a lot of information (variables) about units of analysis, but few of the units themselves to analyze. Generally speaking, you want some large number of units per variable in a model (the number 30 is arbitrarily thrown around) and this number will depend on the unique values of the variable itself.

Let’s begin with an example using some data myself and a colleague put together for a project about violent crime and state expenditures.

state_data<-as.data.frame(read_dta("state_data.dta"))
state_data <- state_data[complete.cases(state_data),]
state_names <- state_data$state
state_data[,1] <- NULL
str(state_data)
## 'data.frame':    637 obs. of  28 variables:
##  $ rmurd    : num  6.2 5.1 6 5.6 4.8 5.4 6.3 3.9 3.1 4.3 ...
##  $ rrape    : num  79.1 79.7 93.3 84.8 81.1 76.4 79.7 65.1 73.4 74.6 ...
##  $ rrobb    : num  81.1 76.2 68.8 68 81 89.5 85 94 93.6 83.2 ...
##  $ raggr    : num  423 404 430 474 465 ...
##  $ rburg    : num  607 609 598 574 623 ...
##  $ rlarc    : num  2635 2765 2784 2457 2601 ...
##  $ rauto    : num  413 385 380 341 391 ...
##  $ poverty  : num  8.5 8.8 9.6 9.1 10 8.9 7.6 8.2 11.7 12.5 ...
##  $ gini     : num  0.565 0.572 0.581 0.59 0.589 ...
##  $ top_1    : num  3.34 3.53 4.21 4.54 4.94 ...
##  $ avgincome: num  51321 51799 52737 54717 59310 ...
##  $ cpi2014  : num  1.34 1.32 1.29 1.25 1.21 ...
##  $ urate    : num  6.43 7.3 7.82 7.45 6.89 ...
##  $ gdp      : num  28553 29763 32039 35195 40063 ...
##  $ netearn  : num  22631 23877 25272 26376 27972 ...
##  $ uinsur   : num  174 248 257 187 161 143 133 162 308 379 ...
##  $ retire   : num  4443 4401 4234 4142 4352 ...
##  $ avwage   : num  37430 38575 39515 40778 42440 ...
##  $ txsocl   : num  506258 545718 575664 618504 659726 ...
##  $ txwork   : num  14910 16988 16084 15844 18152 ...
##  $ txmdcr   : num  239757 266272 294149 357882 383685 ...
##  $ txmdcd   : num  634161 760424 898576 916083 1034714 ...
##  $ txsupp   : num  94117 100720 103264 107078 109814 ...
##  $ txeitc   : num  37837 41331 52011 55733 59880 ...
##  $ txsnap   : num  47920 60330 61464 71514 81477 ...
##  $ txothr   : num  113447 118107 117303 119479 130189 ...
##  $ txuins   : num  105371 153391 160223 118566 103104 ...
##  $ inc_rate : num  7.21 6.85 6.98 6.91 7.21 ...

Lots of information here! To put it simply, there are the typical index crimes measured here but we also focused on social assistance programs and various economic indicators. The ultimate question is, do states with more generous social assistance programs and healthier economics also have lower crime rates (property and violent)?

Let’s run some PCA and see how the states tend to cluster:

pca_out <- prcomp(state_data, scale=TRUE, rank=4)
pca_out$center
##            rmurd            rrape            rrobb            raggr 
##        4.5097331       33.7799058       99.8235479      253.0142857 
##            rburg            rlarc            rauto          poverty 
##      676.4762951     2184.3558870      293.4811617       12.5952904 
##             gini            top_1        avgincome          cpi2014 
##        0.5975789        8.1272836    49857.3552020        1.1623567 
##            urate              gdp          netearn           uinsur 
##        6.0648022   274478.8398744    24442.0172684      199.7566719 
##           retire           avwage           txsocl           txwork 
##     5170.1381476    40264.0313972 11833840.4160126   302501.6938776 
##           txmdcr           txmdcd           txsupp           txeitc 
##  8287612.9670330  6916423.7912088   858770.9105181   924471.7959184 
##           txsnap           txothr           txuins         inc_rate 
##   848072.4442700  1248979.9623234  1289301.4583987        4.2708164
pca_out$scale
##             rmurd             rrape             rrobb             raggr 
##        2.31731914       12.10379396       56.04557875      122.90670747 
##             rburg             rlarc             rauto           poverty 
##      231.67220374      494.10015722      169.27786911        3.31576325 
##              gini             top_1         avgincome           cpi2014 
##        0.03644016        3.36075566     9692.23131430        0.10571304 
##             urate               gdp           netearn            uinsur 
##        2.06800724   335703.27112829     5197.85347248      142.91785962 
##            retire            avwage            txsocl            txwork 
##     1254.48958390     7426.33876008 12252334.80275453   560343.40109421 
##            txmdcr            txmdcd            txsupp            txeitc 
##  9883068.44523843  9363792.89533899  1339138.87226810  1186237.35862496 
##            txsnap            txothr            txuins          inc_rate 
##  1119195.69810643  1798132.13279835  2075257.56321031        1.62631655

I specify rank=4 to keep just the top for principal components. This is mostly for the sake of simplicity in this example. Otherwise the number is much larger. Practically speaking, though, the significance of components beyond that number were fairly low so those would have been cutoff later on, anyhow.

I also specify scale=TRUE. What this means is that not only are all numeric variables centered at their mean, but they are also standardized to have a standard deviation of one (what else has a mean of zero and a standard deviation of one?).

I do this because these variables are measured on various scales and some have very large variances whereas others have low variances. The problem with large differences in variance is that the PCA is driven by explaining the variance in some underlying latent variable - larger variances will overwhelm this process and weight the results so that variables with larger variances have higher loadings onto the latent components.

Simply put, standardize your numeric variables before using PCA, otherwise the results will favor variables with larger variances.

Okay, now that we have the PCA estimated, let’s see show the rotation vector to see which variables load into which component:

pca_out$rotation
##                   PC1           PC2         PC3          PC4
## rmurd     -0.07844698 -0.3163596146  0.14592396 -0.197669694
## rrape      0.08626807 -0.1107065534  0.11701300 -0.147402169
## rrobb     -0.16069486 -0.2264202655 -0.04918335 -0.273973668
## raggr     -0.04517282 -0.2955412866  0.12874537 -0.271778138
## rburg     -0.02110026 -0.3618273201  0.17956615 -0.049808623
## rlarc      0.05932396 -0.3262651258 -0.06314208 -0.071880487
## rauto     -0.02117866 -0.2870733051 -0.19763474 -0.219737521
## poverty   -0.09344289 -0.1856069506  0.38030083  0.177022456
## gini      -0.15341625 -0.0048641775  0.16206981 -0.132180059
## top_1     -0.13630960  0.0614725222 -0.10167101 -0.315588235
## avgincome -0.09022427  0.2606179171 -0.11248701 -0.431644979
## cpi2014    0.10665329 -0.2168776222 -0.37114195  0.112987236
## urate     -0.14744619  0.0174381745  0.38913133  0.109798774
## gdp       -0.29314038 -0.0406749315 -0.15643324  0.048279255
## netearn   -0.10811234  0.2862834902 -0.05955137 -0.381693374
## uinsur    -0.11366418  0.1773077608  0.23408630  0.035753288
## retire    -0.10495492  0.2020985885  0.40360146  0.009777945
## avwage    -0.20212276  0.2070921022  0.07046539 -0.311702400
## txsocl    -0.29483513 -0.0473372065 -0.06658360  0.082196966
## txwork    -0.20849385 -0.0213044621 -0.18722735  0.102746759
## txmdcr    -0.29819370 -0.0356785139 -0.05223487  0.070993059
## txmdcd    -0.29017632 -0.0082301031 -0.11240279  0.083224548
## txsupp    -0.28285410 -0.0528339403 -0.12576878  0.111636986
## txeitc    -0.28148007 -0.1000415026 -0.03175917  0.084738545
## txsnap    -0.28191637 -0.0302022157  0.06904459  0.107734974
## txothr    -0.29357749  0.0005244102 -0.12691342  0.081593336
## txuins    -0.26439106  0.0346141054 -0.01987672  0.128492734
## inc_rate  -0.03593154 -0.2408926635  0.20989169 -0.211063151

From the looks of it, the first factor has to do with how a state spends its tax dollars, the second mostly concerns the crime rate, the third is a mixture of poverty and other income indicators, and the last component is a blend of crime rates and income.

Now, let’s look at a biplot for the first two factors:

plot <- biplot(pca_out, scale=0, cex=c(0.3,0.6), cex.axis=0.6)
plot + abline(h=0, col="grey", lty=4)
## integer(0)
plot + abline(v=0, col="grey", lty=4)

## integer(0)
plot
## NULL

A little messy, but we can see how the individual variables load into the two factors. The dashed lines represent scores of 0 on either component or, for the secondary axis, loadings onto either factor.

Interpreting all of these results is generally pretty tricky as there are also more than two factors from this analysis. But, an example or two is helpful: - The avwage (average wages) loads positively on component 2 but negatively on component 1 - it’s about a magnitude of 0.2 for both. - By contrast, the poverty rate (poverty) loads negatively onto both components; about 0.1 for component 1 and 0.2 for component 2.

We will also want to know how much of the total variance across all four components is explained by each. We can calculate estimates for this by computing the variance for each component then dividing it by the sum of these variances.

pca_variance <- pca_out$sdev^2
pca_explained <- pca_variance/sum(pca_variance)
pca_explained
##  [1] 0.3660829533 0.1816940951 0.1050477463 0.0881976197 0.0539212199
##  [6] 0.0358713052 0.0300284508 0.0253825312 0.0220457499 0.0188042208
## [11] 0.0120471120 0.0102357925 0.0078773972 0.0075363566 0.0066915926
## [16] 0.0055040875 0.0048504007 0.0038584007 0.0036871742 0.0031557929
## [21] 0.0022441882 0.0015315483 0.0011238433 0.0008923340 0.0007424312
## [26] 0.0005345093 0.0002423109 0.0001688359

The explained variance still spits out for all the components the PCA identified, but we are only interested in the first four. We can see here that the first component explained a little over 33% of the total variance across the components, the second explained almost 20%, the third about 10%, and the fourth about 9%. There are definitely diminishing returns with each subsequent component!

This makes sense, because the components are unrelated to one another. By design, the PCA will start computing components to maximize explained variance, then move onto variance that CANNOT be explained by the first component to maximize explained variance for the second component, and so on…

One final note, we can extract scores on these components from the x vector within the pca_out object, like so:

factors <- data.frame(pca_out$x)
str(factors)
## 'data.frame':    637 obs. of  4 variables:
##  $ PC1: num  2.96 2.78 2.64 2.48 2.28 ...
##  $ PC2: num  -2.08 -1.62 -1.77 -1.26 -1.16 ...
##  $ PC3: num  0.0817 0.3642 0.8507 0.7033 0.6841 ...
##  $ PC4: num  -1.45 -1.37 -1.88 -2.11 -2.63 ...

If you use the rbind() function you can then collect these scores into the state_data data frame to use later on. Note that each value within each column represents the factor loading for that observation for that component.

In practice, I recommend putting the variable you want to run a factor analysis on into a separate data frame, running the PCA, then adding the factor score vectors back into the original data frame.

A Final Note on Eigenvalues and Variable Reduction

If you search for “factor analysis” on Google you will undoubtedly come across the terms eigenvector and eigenvalue. We don’t have time to discuss the intricacies of each, but I want to provide some direction on the latter.

An eigenvalue is simply a scaled eigenvector. Eigenvectors are represented by the standard deviations stored by the prcomp() function. You can display these like so:

pca_out$sdev
##  [1] 3.20161251 2.25553423 1.71503262 1.57147490 1.22873681 1.00219586
##  [7] 0.91694963 0.84303670 0.78567232 0.72561573 0.58079182 0.53535240
## [13] 0.46964574 0.45936694 0.43285632 0.39257413 0.36852574 0.32868711
## [19] 0.32131119 0.29725780 0.25067363 0.20708296 0.17739113 0.15806756
## [25] 0.14418070 0.12233667 0.08236932 0.06875613

Where each value represents the eigenvector of the nth principal component in these data. If we were to square these values, we would obtain eigenvalues, which are typically used to decide which components to retain. As a general rule, eigenvalues above 1.00 are considered worthwhile to retain as factors.

Let’s look at the eigenvalues in these data now - we can obtain these values by squaring the sdev vector from the stored PCA analysis:

pca_out$sdev^2
##  [1] 10.250322691  5.087434662  2.941336897  2.469533352  1.509794156
##  [6]  1.004396546  0.840796623  0.710710874  0.617280996  0.526518181
## [11]  0.337319137  0.286602190  0.220567121  0.211017986  0.187364592
## [16]  0.154114449  0.135811221  0.108035219  0.103240878  0.088362201
## [21]  0.062837269  0.042883352  0.031467612  0.024985353  0.020788073
## [26]  0.014966262  0.006784704  0.004727405

So it looks like we could decide to keep the top five factors instead of the top four because the fifth eigenvalue is about 1.51. The sixth is only slightly above one, so I don’t think it’s prudent to keep that one.

If you were to apply this to your own analysis, you would instead keep the first five factors by changing the rank= option to rank=5. You could then examine the loadings onto each component, then change the name to something menaingful for the loading patterns.

The benefit to this approach is that you could reduce your number of variables from some large number - in this example, 28, to a much smaller value (around five, depending on factor loadings).

This simplifies a regression model and is especially useful when you do not have very many observations or have to choose from a very large number of variables.

k-Means Clustering

The k-means algorithm attempts to identify some number of clusters in the data, where k is that number, based upon the position of the clusters with respect to the center point across the number of groups.

That seems very obtuse and requires some additional explanation. The process is somewhat similar to PCA, but in this case we identify clusters, not factors - the distinction is not very meaningful in theory or practice, I think.

Whereas PCA expressly calculates factors based upon variance, the k-means algorithm calculates clustering based upon distance from a centroid, which is itself not very dissimilar from a variance.

The chief difference is that the k-means algorithm expects you to provide k, while PCA will compute as many factors as are evident in the data. Both methods are inherently finding more homogeneous groups within heterogeneous data, but often for different purposes.

In terms of explaining what k-means does, it’s easiest in graph format. I’ll keep using the state_data data frame but I want to cut out some variables for the sake of simplicity. I’ll use the subset function and the select option to do so:

state_data_red <- subset(state_data, 
                         select=c("inc_rate", "poverty"))

Now, I can run the kmeans function on the reduced data frame using an initial number of 3 clusters (chosen arbitrarily!) like so:

set.seed(03151813)
kmeans_out <- kmeans(state_data_red, 3)

First, we will take a look at the numeric results:

kmeans_out
## K-means clustering with 3 clusters of sizes 227, 106, 304
## 
## Cluster means:
##   inc_rate   poverty
## 1 4.426259 13.774890
## 2 5.635049 18.037736
## 3 3.679059  9.816776
## 
## Clustering vector:
##  16  17  18  19  20  21  22  23  24  25  26  27  28  31  32  33  34  35  36  37 
##   3   3   3   3   3   3   3   3   1   1   1   3   3   1   1   1   1   1   1   1 
##  38  39  40  41  42  43  46  47  48  49  50  51  52  53  54  55  56  57  58  61 
##   2   2   2   2   2   2   2   2   2   1   1   2   1   1   2   1   2   2   2   1 
##  62  63  64  65  66  67  68  69  70  71  72  73  76  77  78  79  80  81  82  83 
##   1   1   1   1   1   1   1   1   2   2   1   1   3   3   3   3   3   3   3   3 
##  84  85  86  87  88  91  92  93  94  95  96  97  98  99 100 101 102 103 106 107 
##   1   1   1   1   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 108 109 110 111 112 113 114 115 116 117 118 121 122 123 124 125 126 127 128 129 
##   3   3   3   3   3   3   1   1   1   1   1   1   1   1   3   3   3   1   1   1 
## 130 131 132 133 136 137 138 139 140 141 142 143 144 145 146 147 148 151 152 153 
##   2   1   1   1   1   3   1   1   1   1   1   1   2   2   2   2   2   3   3   3 
## 154 155 156 157 158 159 160 161 162 163 166 167 168 169 170 171 172 173 174 175 
##   3   3   3   3   3   1   1   1   1   3   3   3   3   3   3   3   3   1   1   1 
## 176 177 178 181 182 183 184 185 186 187 188 189 190 191 192 193 196 197 198 199 
##   1   1   1   3   1   1   1   3   3   3   1   1   1   1   1   1   3   3   3   3 
## 200 201 202 203 204 205 206 207 208 211 212 213 214 215 216 217 218 219 220 221 
##   1   3   1   1   2   2   1   1   3   3   3   3   3   3   3   3   3   3   3   3 
## 222 223 226 227 228 229 230 231 232 233 234 235 236 237 238 241 242 243 244 245 
##   3   3   3   3   3   3   1   1   3   1   1   1   1   1   1   1   1   1   2   1 
## 246 247 248 249 250 251 252 253 256 257 258 259 260 261 262 263 264 265 266 267 
##   2   1   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1   2   2   2 
## 268 271 272 273 274 275 276 277 278 279 280 281 282 283 286 287 288 289 290 291 
##   2   3   1   3   3   1   3   3   3   3   1   1   1   1   3   3   3   3   3   3 
## 292 293 294 295 296 297 298 301 302 303 304 305 306 307 308 309 310 311 312 313 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 316 317 318 319 320 321 322 323 324 325 326 327 328 331 332 333 334 335 336 337 
##   3   3   3   1   1   1   3   1   1   1   1   1   1   3   3   3   3   3   3   3 
## 338 339 340 341 342 343 346 347 348 349 350 351 352 353 354 355 356 357 358 361 
##   3   3   3   3   3   3   2   2   2   2   2   2   2   2   2   2   2   2   2   3 
## 362 363 364 365 366 367 368 369 370 371 372 373 376 377 378 379 380 381 382 383 
##   3   3   1   1   3   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 384 385 386 387 388 391 392 393 394 395 396 397 398 399 400 401 402 403 406 407 
##   1   1   2   1   1   3   3   3   3   3   3   3   3   3   3   3   1   3   3   3 
## 408 409 410 411 412 413 414 415 416 417 418 421 422 423 424 425 426 427 428 429 
##   3   3   3   3   3   3   1   2   1   1   2   3   3   3   3   3   3   3   3   3 
## 430 431 432 433 436 437 438 439 440 441 442 443 444 445 446 447 448 451 452 453 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   2   2   2 
## 454 455 456 457 458 459 460 461 462 463 466 467 468 469 470 471 472 473 474 475 
##   2   2   2   1   2   2   2   2   2   2   1   1   1   1   1   1   1   1   1   1 
## 476 477 478 481 482 483 484 485 486 487 488 489 490 491 492 493 496 497 498 499 
##   1   2   1   1   1   1   1   1   1   1   1   2   2   1   2   2   1   3   3   3 
## 500 501 502 503 504 505 506 507 508 511 512 513 514 515 516 517 518 519 520 521 
##   3   3   3   3   3   1   3   3   3   3   3   3   3   1   1   1   1   1   1   1 
## 522 523 526 527 528 529 530 531 532 533 534 535 536 537 538 541 542 543 544 545 
##   1   1   1   1   1   3   2   1   1   1   1   2   1   2   1   3   3   1   3   1 
## 546 547 548 549 550 551 552 553 556 557 558 559 560 561 562 563 564 565 566 567 
##   3   1   3   1   1   1   1   1   3   3   3   3   3   3   3   3   3   1   1   1 
## 568 571 572 573 574 575 576 577 578 579 580 581 582 583 586 587 588 589 590 591 
##   1   3   3   3   3   1   3   3   1   1   1   1   1   1   1   1   1   1   1   3 
## 592 593 594 595 596 597 598 601 602 603 604 605 606 607 608 609 610 611 612 613 
##   1   1   1   2   2   2   1   3   3   1   1   1   3   3   1   1   1   1   1   3 
## 616 617 618 619 620 621 622 623 624 625 626 627 628 631 632 633 634 635 636 637 
##   1   1   1   1   1   1   1   1   2   2   2   2   2   1   2   2   2   2   2   2 
## 638 639 640 641 642 643 646 647 648 649 650 651 652 653 654 655 656 657 658 661 
##   2   2   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 662 663 664 665 666 667 668 669 670 671 672 673 676 677 678 679 680 681 682 683 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 684 685 686 687 688 691 692 693 694 695 696 697 698 699 700 701 702 703 706 707 
##   3   3   3   3   3   3   3   1   3   3   3   3   3   3   3   1   3   3   1   2 
## 708 709 710 711 712 713 714 715 716 717 718 721 722 723 724 725 726 727 728 729 
##   2   1   1   1   1   1   1   2   2   2   2   3   3   3   1   3   3   3   3   3 
## 730 731 732 733 736 737 738 739 740 741 742 743 744 745 746 747 748 
##   3   1   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 
## Within cluster sum of squares by cluster:
## [1]  640.4074  641.4106 1280.9639
##  (between_SS / total_SS =  70.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The first part of these results show the variable averages for the different clusters. It’s helpful to compare these to the overall sample averages:

kmeans_out$centers
##   inc_rate   poverty
## 1 4.426259 13.774890
## 2 5.635049 18.037736
## 3 3.679059  9.816776
summary(state_data_red)
##     inc_rate        poverty    
##  Min.   :1.325   Min.   : 5.4  
##  1st Qu.:3.192   1st Qu.:10.0  
##  Median :4.103   Median :12.2  
##  Mean   :4.271   Mean   :12.6  
##  3rd Qu.:5.031   3rd Qu.:14.8  
##  Max.   :8.856   Max.   :23.1

We see that Cluster 1 has a slightly below average value for the incarceration rate as well as a below average value for the poverty rate, Cluster 2 has an above average value for the incarceration rate and a much higher than average value for poverty, and Cluster 3 has an about average value for the incarceration rate and a slightly higher than average poverty rate.

The overall picture I see here is that the relationship between the poverty rate and incarceration rate is positive - higher values of one variable tend to correspond to higher values of the other.

Let’s now take a look at the plot of these clusters, with the assigned cluster representing the color of each dot:

plot(state_data_red, col=(kmeans_out$cluster +1), # +1 to suppress default color
     main="K-Means Clustering of Reduced State Data with K=3",
     pch=20, cex=2)

It looks like the groups are fairly separate in two-dimensional space, but there is definitely some overlap on both the x and y-axis.

We may want to assess the performance of this model with respect to randomly assigning each observation to a random cluster and seeing what values of the within-cluster sum-of-squares are calculated for the best models.

We focus on the within-cluster sum-of-squares because this provides an indicator of how much variation there is within the cluster. We want the smallest values on this as possible because that means that observations within a cluster are more similar to each other than they are dissimilar.

We can do this by assigning different values to the nstart= option within the kmeans() function like so:

kmeans1 <- kmeans(state_data_red, 3, nstart=1)
kmeans1$tot.withinss
## [1] 2562.782

That’s a really high value! There’s no standard for what the value should be, as it will depend upon how the variable is measured, but both the poverty rates and incarceration rates are pretty low values, so a within-cluster sum-of-squares that high tells us that there is a lot of within-cluster variation.

Let’s increase the nstart value and see if we can minimize that a bit:

kmeans2 <- kmeans(state_data_red, 3, nstart=50)
kmeans2$tot.withinss
## [1] 2562.782

Wow. No movement whatsoever. Despite a pretty decent bivariate correlation between these variables (not shown, but cor.test provides a positive correlation of 0.39) there remains a LOT of variability within the clusters.

Adding Some Complexity to k-Means Clustering

The two variable example above was selected for its simplicity. In practice, we may use many different variables in the clustering model, but that can make the output a bit unwieldy (especially the plots!).

Let’s try a more complex model now with more variables. First, I want to select more variable from the state_data and I want to create a total crime rate variable:

state_data$ttl_crime <- with(state_data, rmurd+rrape+rrobb+raggr+
                               rburg+rlarc+rauto)
state_data_red <- subset(state_data, 
                         select=c("inc_rate", "poverty", "txsocl", "ttl_crime"))

Now, I want to scale the variables so it’s easier to read the subsequent plots. I can use the scale() function to accomplish this:

scaled_state_data <-as.data.frame(scale(state_data_red, 
                                        center=TRUE, scale=TRUE))

Before we move forward, let’s take a quick look at summaries from both data frames:

summary(state_data_red)
##     inc_rate        poverty         txsocl           ttl_crime   
##  Min.   :1.325   Min.   : 5.4   Min.   :  506258   Min.   :1946  
##  1st Qu.:3.192   1st Qu.:10.0   1st Qu.: 3277418   1st Qu.:2819  
##  Median :4.103   Median :12.2   Median : 7966611   Median :3470  
##  Mean   :4.271   Mean   :12.6   Mean   :11833840   Mean   :3545  
##  3rd Qu.:5.031   3rd Qu.:14.8   3rd Qu.:14998182   3rd Qu.:4191  
##  Max.   :8.856   Max.   :23.1   Max.   :75232408   Max.   :6404
summary(scaled_state_data)
##     inc_rate          poverty            txsocl          ttl_crime      
##  Min.   :-1.8111   Min.   :-2.1700   Min.   :-0.9245   Min.   :-1.7981  
##  1st Qu.:-0.6632   1st Qu.:-0.7827   1st Qu.:-0.6984   1st Qu.:-0.8173  
##  Median :-0.1033   Median :-0.1192   Median :-0.3156   Median :-0.0846  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4676   3rd Qu.: 0.6649   3rd Qu.: 0.2583   3rd Qu.: 0.7261  
##  Max.   : 2.8196   Max.   : 3.1681   Max.   : 5.1744   Max.   : 3.2146

We can confirm that the means for the scaled data frame are 0 - if you wanted to also check the standard deviations they would all have a value around 1.

Now, let’s run the kmeans function again, specifying the same number of groups:

scaled_kmeans <-kmeans(scaled_state_data, 3, nstart=20)
scaled_kmeans
## K-means clustering with 3 clusters of sizes 217, 80, 340
## 
## Cluster means:
##      inc_rate    poverty     txsocl  ttl_crime
## 1  0.87206427  0.6763141 -0.2051128  0.8602698
## 2  0.08823375  0.4734082  2.1416834 -0.1232562
## 3 -0.57734308 -0.5430377 -0.3730153 -0.5200531
## 
## Clustering vector:
##  16  17  18  19  20  21  22  23  24  25  26  27  28  31  32  33  34  35  36  37 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  38  39  40  41  42  43  46  47  48  49  50  51  52  53  54  55  56  57  58  61 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
##  62  63  64  65  66  67  68  69  70  71  72  73  76  77  78  79  80  81  82  83 
##   2   2   2   2   2   2   2   2   2   2   2   2   3   3   3   3   1   3   3   3 
##  84  85  86  87  88  91  92  93  94  95  96  97  98  99 100 101 102 103 106 107 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   1 
## 108 109 110 111 112 113 114 115 116 117 118 121 122 123 124 125 126 127 128 129 
##   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2 
## 130 131 132 133 136 137 138 139 140 141 142 143 144 145 146 147 148 151 152 153 
##   2   2   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 154 155 156 157 158 159 160 161 162 163 166 167 168 169 170 171 172 173 174 175 
##   1   1   3   3   3   1   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 176 177 178 181 182 183 184 185 186 187 188 189 190 191 192 193 196 197 198 199 
##   3   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   3   3   3   3 
## 200 201 202 203 204 205 206 207 208 211 212 213 214 215 216 217 218 219 220 221 
##   3   3   3   1   1   1   1   1   3   3   3   3   3   3   3   3   3   3   3   3 
## 222 223 226 227 228 229 230 231 232 233 234 235 236 237 238 241 242 243 244 245 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   3 
## 246 247 248 249 250 251 252 253 256 257 258 259 260 261 262 263 264 265 266 267 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 268 271 272 273 274 275 276 277 278 279 280 281 282 283 286 287 288 289 290 291 
##   1   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 292 293 294 295 296 297 298 301 302 303 304 305 306 307 308 309 310 311 312 313 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 316 317 318 319 320 321 322 323 324 325 326 327 328 331 332 333 334 335 336 337 
##   3   1   1   1   1   1   2   2   2   2   2   2   2   3   3   3   3   3   3   3 
## 338 339 340 341 342 343 346 347 348 349 350 351 352 353 354 355 356 357 358 361 
##   3   3   3   3   3   3   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 362 363 364 365 366 367 368 369 370 371 372 373 376 377 378 379 380 381 382 383 
##   1   1   1   1   1   1   1   1   1   1   1   1   3   3   3   3   3   3   3   3 
## 384 385 386 387 388 391 392 393 394 395 396 397 398 399 400 401 402 403 406 407 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1 
## 408 409 410 411 412 413 414 415 416 417 418 421 422 423 424 425 426 427 428 429 
##   1   1   1   1   1   1   1   1   1   1   1   3   3   3   3   3   3   3   3   3 
## 430 431 432 433 436 437 438 439 440 441 442 443 444 445 446 447 448 451 452 453 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   1   1 
## 454 455 456 457 458 459 460 461 462 463 466 467 468 469 470 471 472 473 474 475 
##   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2   2 
## 476 477 478 481 482 483 484 485 486 487 488 489 490 491 492 493 496 497 498 499 
##   2   2   2   1   1   1   1   1   1   1   1   1   2   2   2   2   3   3   3   3 
## 500 501 502 503 504 505 506 507 508 511 512 513 514 515 516 517 518 519 520 521 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   2 
## 522 523 526 527 528 529 530 531 532 533 534 535 536 537 538 541 542 543 544 545 
##   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 546 547 548 549 550 551 552 553 556 557 558 559 560 561 562 563 564 565 566 567 
##   3   3   3   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   2   2 
## 568 571 572 573 574 575 576 577 578 579 580 581 582 583 586 587 588 589 590 591 
##   2   3   3   3   3   3   3   3   3   3   3   3   3   3   1   1   1   1   1   1 
## 592 593 594 595 596 597 598 601 602 603 604 605 606 607 608 609 610 611 612 613 
##   1   1   1   1   1   1   1   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 616 617 618 619 620 621 622 623 624 625 626 627 628 631 632 633 634 635 636 637 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
## 638 639 640 641 642 643 646 647 648 649 650 651 652 653 654 655 656 657 658 661 
##   2   2   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 662 663 664 665 666 667 668 669 670 671 672 673 676 677 678 679 680 681 682 683 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 684 685 686 687 688 691 692 693 694 695 696 697 698 699 700 701 702 703 706 707 
##   3   3   3   3   3   3   3   1   1   3   3   3   3   3   3   3   3   3   3   3 
## 708 709 710 711 712 713 714 715 716 717 718 721 722 723 724 725 726 727 728 729 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 730 731 732 733 736 737 738 739 740 741 742 743 744 745 746 747 748 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 
## Within cluster sum of squares by cluster:
## [1] 612.2239 191.9039 566.2975
##  (between_SS / total_SS =  46.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

And now, let’s see what the new plot looks like:

plot(scaled_state_data, col=(scaled_kmeans$cluster +1),
     main="K-Means Clustering Results with K=3",
     pch=20, cex=2)

This now produces a matrix of plots with different x and y-axes corresponding to the different variables used for the clustering algorithm.

For an example, let’s look at the plots in the third column. In this column, the amount of US dollars spent on social programs is along the x-axis, while the other variables are on the y-axis.

The groups are fairly distinct in these plots - states that spend more than average on social programs tend to have average values of total crime rates, poverty rates, and incarceration rates.

By contrast, states that spend lower than average dollars on social programs are split into two groups - one having higher than average values for incarceration, poverty, and crime rates and the other group having average or below average values for these rates.

As a final exercise, let’s let the model run wild a bit and give it all the scaled variables as inputs:

scaled_state_data <- as.data.frame(scale(state_data, 
                                         center=TRUE,
                                         scale=TRUE))
full_kmeans <- kmeans(scaled_state_data, 10, nstart=50)
full_kmeans
## K-means clustering with 10 clusters of sizes 61, 89, 29, 61, 99, 25, 16, 116, 73, 68
## 
## Cluster means:
##          rmurd       rrape      rrobb      raggr      rburg      rlarc
## 1  -0.05230550 -0.98935820  0.5819220 -0.2342269 -0.7895238 -0.8680263
## 2  -0.79308153 -0.32565851 -1.0802395 -0.7894424 -0.8191102 -0.5170396
## 3   0.33953830 -0.44077735  1.0927172  0.5791288  0.2752235  0.1935165
## 4   0.10686656 -0.18389337  0.3323008 -0.4017537  0.1595631 -0.3859843
## 5   1.24855361  0.26762568  0.9544124  1.1978524  1.1704297  1.0135120
## 6   0.05103608  2.39429837  0.4854701  1.4747585 -0.2258376  0.4015140
## 7   0.28978178 -1.03159025  1.1063799  0.1319453 -0.5599562 -1.0053496
## 8  -0.91413987  0.05301197 -1.0058211 -0.7091065 -0.8045125 -0.6343067
## 9   0.98300578  0.26004172  0.2368378  0.7203768  1.1128756  0.1592634
## 10 -0.55630874  0.26992801 -0.2390390 -0.5240909  0.2082819  1.2435678
##          rauto    poverty        gini       top_1  avgincome     cpi2014
## 1  -0.18367810 -0.8573686  0.15448123  0.71904574  1.6265718 -0.18307143
## 2  -0.64952201 -0.4479144 -0.35620945 -0.41568409 -0.6713955  0.84585031
## 3   0.05252460  0.7013164  1.55241374  1.63680824  0.2754916 -0.04766579
## 4  -0.36808728  0.4730869 -0.05422287 -0.29133700 -0.2270718 -0.75571812
## 5   1.19533694  0.3036193 -0.05159006 -0.02106673 -0.6651118  0.91292062
## 6  -0.01581519 -0.7923637 -0.92939959 -0.68197245  0.7504781  0.04966259
## 7   0.66318526  0.6837670  1.59402857  1.37647791  0.8855183 -0.50328768
## 8  -0.73719918 -0.2970494 -0.13130016 -0.10274527  0.4771617 -0.86379903
## 9  -0.12003647  1.4461589  0.72519680 -0.05451352 -0.5917393 -0.78315798
## 10  0.81861410 -0.7254995 -0.79858407 -0.34636146 -0.1888605  0.84074868
##          urate         gdp     netearn      uinsur     retire      avwage
## 1  -0.08208121  0.29685758  1.77219103  0.68626408  0.2089227  1.52687618
## 2  -0.85771344 -0.58444222 -0.73074144 -0.63549852 -0.6388676 -1.11866607
## 3   0.08476802  2.22530868  0.08996750 -0.33656745  0.1721814  0.71379718
## 4   1.28487499  0.44562699  0.01546334  1.12260322  0.8040834  0.48173482
## 5  -0.35944824 -0.08537633 -0.71350104 -0.63884337 -0.8536183 -0.65485334
## 6   0.07670064 -0.66841541  0.81750337  0.13548571  0.1898636  0.69191142
## 7   1.05185201  4.25446807  1.01153202  0.82647703  0.5072177  1.87401815
## 8  -0.03912444 -0.50429802  0.42800990  0.25890360  0.6295593  0.11841399
## 9   0.81698994 -0.29751014 -0.51742977  0.01972231  0.8306704 -0.04531131
## 10 -0.55523234 -0.30150279 -0.36003316 -0.42439574 -1.0580750 -0.53741179
##         txsocl     txwork     txmdcr     txmdcd      txsupp      txeitc
## 1   0.19629515 -0.1897565  0.2099301  0.1493175 -0.02874765 -0.11266938
## 2  -0.64886546 -0.3525371 -0.6227051 -0.5413621 -0.46247487 -0.59832013
## 3   2.33584419  1.2833909  2.4065866  2.1556542  1.82671867  2.67451867
## 4   0.89391207  0.4746795  0.7601900  0.4797826  0.42292587  0.55254295
## 5  -0.08234482 -0.2942804 -0.1539651 -0.1504991 -0.11044931  0.02130350
## 6  -0.85523746 -0.5096292 -0.7529021 -0.6270696 -0.56665172 -0.70337211
## 7   3.48172159  3.8986422  3.6899873  4.3640520  4.83716353  3.37378613
## 8  -0.55930405 -0.3548428 -0.5148294 -0.4735798 -0.45619139 -0.53317247
## 9  -0.14861287 -0.3190648 -0.1170851 -0.1655547 -0.13558250  0.02774262
## 10 -0.39616998  0.3048069 -0.4449246 -0.3667028 -0.37261129 -0.43860827
##         txsnap     txothr     txuins    inc_rate  ttl_crime
## 1  -0.08793903  0.2532314  0.3348039 -0.36723729 -0.7321866
## 2  -0.61705398 -0.5442316 -0.5005347 -0.61305524 -0.8079377
## 3   2.10864380  1.9209959  0.9514849  0.63699609  0.3329894
## 4   0.99754602  0.5618700  0.9140458 -0.05595911 -0.2797457
## 5  -0.21696051 -0.2507025 -0.2710354  0.59478024  1.3280851
## 6  -0.66097184 -0.5986402 -0.5423314  2.08863387  0.4283555
## 7   3.29563314  4.6424529  3.9656181 -0.20813303 -0.5035142
## 8  -0.45837802 -0.4155248 -0.3682495 -0.63416225 -0.8653442
## 9   0.12261909 -0.2583676 -0.2333826  0.93340153  0.4761188
## 10 -0.47389675 -0.3591976 -0.3313313 -0.59471237  0.8157029
## 
## Clustering vector:
##  16  17  18  19  20  21  22  23  24  25  26  27  28  31  32  33  34  35  36  37 
##   6   6   6   6   6   6   6   6   6   6   6   6   6   5   5   5   5   5   5   5 
##  38  39  40  41  42  43  46  47  48  49  50  51  52  53  54  55  56  57  58  61 
##   9   9   9   9   9   9   5   5   5   5   5   9   9   9   9   9   9   9   9   3 
##  62  63  64  65  66  67  68  69  70  71  72  73  76  77  78  79  80  81  82  83 
##   3   7   7   7   7   7   7   7   7   7   7   7  10  10  10  10  10  10   8   8 
##  84  85  86  87  88  91  92  93  94  95  96  97  98  99 100 101 102 103 106 107 
##   8   8   8   8   8   1   1   1   1   1   1   1   1   1   1   1   1   1   6   6 
## 108 109 110 111 112 113 114 115 116 117 118 121 122 123 124 125 126 127 128 129 
##   6   6   6   6   6   6   6   6   6   9   6   5   5   5   5   3   3   3   3   3 
## 130 131 132 133 136 137 138 139 140 141 142 143 144 145 146 147 148 151 152 153 
##   3   3   3   3   5   5   5   5   5   5   5   9   4   4   4   4   4  10  10  10 
## 154 155 156 157 158 159 160 161 162 163 166 167 168 169 170 171 172 173 174 175 
##  10  10  10  10  10   8   8   8   8   8   2   2   2   2   2   2   2   8   8   8 
## 176 177 178 181 182 183 184 185 186 187 188 189 190 191 192 193 196 197 198 199 
##   8   8   8   5   5   5   5   1   1   1   1   4   4   4   4   4  10  10  10  10 
## 200 201 202 203 204 205 206 207 208 211 212 213 214 215 216 217 218 219 220 221 
##  10  10  10   9   4   4   4   4   4   2   2   2   2   2   2   2   8   8   8   8 
## 222 223 226 227 228 229 230 231 232 233 234 235 236 237 238 241 242 243 244 245 
##   8   8  10  10  10  10  10  10  10   8   8   8   8   8   8   2   2   2   2   2 
## 246 247 248 249 250 251 252 253 256 257 258 259 260 261 262 263 264 265 266 267 
##   2   9   9   4   4   4   4   4   5   5   5   5   9   5   5   9   9   9   9   9 
## 268 271 272 273 274 275 276 277 278 279 280 281 282 283 286 287 288 289 290 291 
##   9   2   2   2   2   2   2   2   8   8   8   8   8   8   5   5   5   5   5   1 
## 292 293 294 295 296 297 298 301 302 303 304 305 306 307 308 309 310 311 312 313 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 316 317 318 319 320 321 322 323 324 325 326 327 328 331 332 333 334 335 336 337 
##   5   5   5   5   5   5   4   4   4   4   4   4   4  10  10  10  10  10   8   8 
## 338 339 340 341 342 343 346 347 348 349 350 351 352 353 354 355 356 357 358 361 
##   8   8   8   8   8   8   5   5   5   9   9   9   9   9   9   9   9   9   9   5 
## 362 363 364 365 366 367 368 369 370 371 372 373 376 377 378 379 380 381 382 383 
##   5   5   5   5   5   5   9   9   9   9   9   9   2   2   2   2   2   2   2   2 
## 384 385 386 387 388 391 392 393 394 395 396 397 398 399 400 401 402 403 406 407 
##   8   8   8   8   8  10  10  10  10  10   2   2   8   8   8   8   8   8   5   5 
## 408 409 410 411 412 413 414 415 416 417 418 421 422 423 424 425 426 427 428 429 
##   5   5   5   5   5   5   9   9   9   9   9   2   2   2   2   2   8   8   8   8 
## 430 431 432 433 436 437 438 439 440 441 442 443 444 445 446 447 448 451 452 453 
##   8   8   8   8   1   1   1   1   1   1   1   1   1   1   1   1   1   5   5   5 
## 454 455 456 457 458 459 460 461 462 463 466 467 468 469 470 471 472 473 474 475 
##   5   5   5   5   9   9   9   9   9   9   3   3   3   3   3   3   3   3   7   7 
## 476 477 478 481 482 483 484 485 486 487 488 489 490 491 492 493 496 497 498 499 
##   7   7   7   5   5   5   5   5   5   5   9   4   4   4   4   4   2   2   2   2 
## 500 501 502 503 504 505 506 507 508 511 512 513 514 515 516 517 518 519 520 521 
##   2   2   2   8   8   8   8   8   8  10  10  10  10  10   4   4   4   4   4   4 
## 522 523 526 527 528 529 530 531 532 533 534 535 536 537 538 541 542 543 544 545 
##   4   4   5   5   5   5   5   5   9   9   9   9   9   9   9  10  10  10  10  10 
## 546 547 548 549 550 551 552 553 556 557 558 559 560 561 562 563 564 565 566 567 
##  10  10   8   4   4   4   8   8   2   4   4   4   4   4   1   4   4   4   4   4 
## 568 571 572 573 574 575 576 577 578 579 580 581 582 583 586 587 588 589 590 591 
##   4  10  10   2   2   2   8   8   8   8   8   8   8   8   5   5   5   5   5   5 
## 592 593 594 595 596 597 598 601 602 603 604 605 606 607 608 609 610 611 612 613 
##   5   9   9   9   9   9   9   2   2   2   2   2   2   2   8   8   8   8   8   8 
## 616 617 618 619 620 621 622 623 624 625 626 627 628 631 632 633 634 635 636 637 
##   5   5   5   5   5   5   5   9   9   9   9   9   9   5   5   5   3   3   3   3 
## 638 639 640 641 642 643 646 647 648 649 650 651 652 653 654 655 656 657 658 661 
##   3   3   3   3   3   3  10  10  10  10  10  10  10  10   8   8   8   8   8   2 
## 662 663 664 665 666 667 668 669 670 671 672 673 676 677 678 679 680 681 682 683 
##   2   2   2   2   2   8   8   8   8   8   8   8   2   2   2   2   1   1   1   1 
## 684 685 686 687 688 691 692 693 694 695 696 697 698 699 700 701 702 703 706 707 
##   1   1   1   1   1  10  10  10  10  10  10  10  10   4   4   4   4   4   2   2 
## 708 709 710 711 712 713 714 715 716 717 718 721 722 723 724 725 726 727 728 729 
##   2   2   2   2   2   2   8   8   8   8   8   2   2   2   2   2   2   8   8   4 
## 730 731 732 733 736 737 738 739 740 741 742 743 744 745 746 747 748 
##   4   8   8   8   2   2   2   2   2   8   8   8   8   8   8   8   8 
## 
## Within cluster sum of squares by cluster:
##  [1]  821.1876  702.7311  633.4840  763.9324 1220.3068  260.1605  463.8424
##  [8] 1042.5231  744.5311  564.6194
##  (between_SS / total_SS =  60.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
plot(scaled_state_data[,c(8,14,19,28,29)], col=(full_kmeans$cluster +1),
     main="Model Ran Wild, K=10",
     pch=20, cex=2)

Final Notes

There are methods to detect the ideal number of clusters in the data as well as methods (like the hclust() function) that can work without specifying the number of clusters before running the model.

I don’t have time to further explore these today, but this was a rudimentary introduction to two different types of unsupervised learning techniques that try to summarize homogeneity in your data.

Two Questions

What are your two questions?

Time to Go Home